The available data comes from an online shop, which sells beauty products. There are two datasets given: One with data about customer orders and another with data about customer clicks on the website.
The orders dataset consists of one row for each customer order in the time periode from the 28. January 2000 to the 3. March 2000. The data contains values such as the number of product units ordered, the total order amount, payment information, the manufacturer and brand names of the ordered products and social data about the customer, as by example his location or age.
The clicks dataset contains data referring to customer clicks on the website of the given company. Therefore, it is composed of data such as the timestamp of a click, the session length and the name of the clicked product, as well as the product category it belongs to. In this way the whole session course of a customer can be illustrated through the data.
Both datasets share a considerable amount of columns. However, since not every click results in an order and since a session consists normally of more than one click, the contents differ significantly.
The project’s task is to analyze the dataset, especially by creating plots and statistical tables for the data, that is suspected to be relevant for the online shop in some way.
In addition to that the performance of different Recomendation systems has to be analyzed.
Finally, a model has to be constructed to predict whether or not a user of the web-shop will order something, based on his browsing behaviour on the site.
Before we started cleaning the data, we copied it to the folder “0 Data”. The reason for this was to avoid accidentally altering the original dataset by separating the edited datasets from the original ones. The cleaned data and further forms of the datasets were also saved to this folder. Before explaining the cleaning process, it makes sense to get an overview of the content of the aforementioned folder. It can be described as follows:
The cleaning process consists of the following steps:
Copy the original datasets to the folder “0 Data”
Read the data, add headers (i.e. column labels), replace “?” and “NULL” with NA, drop columns which have a 100% ratio of missing data, reformat datetime cells and save the result (e.g. as “order_data_cleaned_R.csv”)
Create a subset of the cleaned data, containing only 1000 rows, and save it (e.g. as “order_data_small_R.csv”)
A similar cleaning process to the one explained above has been implemented in Python.
Note: Python coding chunks are excutable in RMarkdown in general, but the Python environment is not persistent across different python chunks for the preview function ro run coding. Despite this, the chunks are compiled together, when the document is knitted. To execute the provided Python code copy it into the Python IDE of your choice, ideally under the recommended location, which is mentioned at the top of code block. The path is always relative to the repository.
To test if the cleaning scripts in Python and R result in the same file, execute the following code in RStudio. The package creates a diff view
library(diffobj)
pathOrders <- "0 Data/order_data_cleaned_R.csv"
pathClicks <- "0 Data/clickstream_data_cleaned_R.csv"
pathOrdersPython <- "0 Data/order_data_cleaned_P.csv"
pathClicksPython <- "0 Data/clickstream_data_cleaned_P.csv"
orders <- read.csv(file=pathOrders)
clicks <- read.csv(file=pathClicks)
clicksP = read.csv(file=pathClicksPython,na.strings=c("","NA"))
diffPrint(target=clicksP,current=clicks)
ordersP = read.csv(file=pathOrdersPython,na.strings=c("","NA"))
diffPrint(target=ordersP,current=orders)
Note: Since we do not use the merged datasets for any in-depth analysis we consider it to be a part of “Data exploration”, instead of “Data cleaning”. Thus we implemented it in Python only.
Furthermore, we tried to merge the click and order data in Python by trying different ID combinations that occure in both datasets. For testing the different combinations we used an inner join in order to be able to recognize easier whether a merging try had success. We tried the following combinations for merging the two datasets, which resulted in the shown shapes for the merged dataset:
| Clicks | Orders | Shape |
|---|---|---|
| Session ID | Order Line Session ID | [0, 438] |
| Session ID | Order Session ID | [0, 438] |
| Customer ID | Customer ID | [6906, 437] |
| Session Cookie ID | Order Line Session ID | [0, 438] |
| Session Cookie ID | Order Session ID | [0, 438] |
In this way we were able to discover that it is possible to join on the ‘Customer ID’ for some instances. Thus, we saved a dataset for the merging results on the Customer ID. But the merged data makes rather less sense since a Customer ID can have multiple order and click rows. Because of this issue, we decided on building a second, smaller data subset containing only the customer information columns of both original datasets. The final merged customer dataset contains 80 attributes for 97 customers.
The aim of the data analysis is to extract information, which is suspected to be valuable to the online shop, and prepare it in a way that makes it easily “digestible”. The overview of the information is presented in statistical tables and plots.
Before creating overview tables or plots for columns, it makes sense to look which columns actually contain a large quantity of information and which do not. To do a check up on the ratio of filled cells, we created a ranking for both datasets containing column names and the percentage of missing data for each column. Columns with a low percentage of missing data are then preferred in later analysis steps. The first 50 entries in the resulting rankings can be seen in the following two tables (left: Orders, right: Clicks).
|
|
|
The order data can be mainly devided into 4 sections:
Customer Data: For this section we regard all information referring to the customer as an individual. This data contains information such as the customer gender, location, family status and retail activities.
Product Data: The following data columns describe features of the ordered products.
Payment Data: This sections contains columns which describe the payment methods used by the customers.
Order Data: The order data section contains information describing the order process itself, such as order quantity and price data.
The clickstream data has three main categories:
Customer Data: The data contains a vast collection of information about customers, reaching from usual informations like age, gender, etc. over payment information and finacnial activities to opinions about the shop.
Product Data: The following data columns describe features of products clicked on by the customers.
Time Data: The following data columns describe different dates and times for clicks.
Given a subset of interesting columns, we create two types of statistical tables for each: One table for numerical columns in the subset and another for factors. The statistical table for the numerical data contains the maximum value, minimum value, mean, median and standard deviation for each column. Whereas the factorial tables contain the five most frequent factors as well as their percentage, the ratio of NAs and other factors for each column. For this, the NA percentage gets calculated at first, then the NA values are deleted from the regarded column and the percentage for each value is calcuated.
In the following section the statistical tables generated for the purpose of describing the order data are shown. Additionally, the most important or interesting analysis results are emphasized and shortly explained.
Note: To support a better visualization, more relevant columns are highligthed in black color.
Only the age can be regarded as a numerical customer data column here. As most intersting in this summary might appear the mean and the median, which both imply an average customer segment consisting of people in their late 30s.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Age | 98 | 38.37 | 36 | 18 | 10.87 |
The following data summary shows some social data for the shop’s customer segment. Since all of the available data for the country column contains the value ‘United States’, it is highly probable that the online shop exclusively delivers customer located in the US. This was the reason for us to choose a map of the United States in order to visualize the customers’ locations later on in the plotting. Furthermore, the data clearly shows that the main customer audience targeted are women.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| City | New York: 4.53% | San Francisco: 2.05% | Stamford: 1.24% | Austin: 1.13% | Brooklyn: 0.98% | 90.07% | 0% |
| Country | United States: 100% | 0% | 2.83% | ||||
| US.State | CA: 14.63% | NY: 14.11% | TX: 6.93% | PA: 5.8% | CT: 5.28% | 53.25% | 0% |
| Marital.Status | Married: 66.13% | Single: 22.02% | Inferred Single: 7.15% | Inferred Married: 4.7% | 0% | 34.98% | |
| Gender | Female: 83.06% | Male: 16.94% | 0% | 44.96% | |||
| Audience | Women: 81.17% | Men: 12.5% | Children: 6.33% | 0% | 11.08% | ||
| Truck.Owner | False: 78.55% | True: 21.45% | 0% | 22.22% | |||
| RV.Owner | False: 91.5% | True: 8.5% | 0% | 22.22% | |||
| Motorcycle.Owner | False: 98.66% | True: 1.34% | 0% | 22.22% | |||
| Working.Woman | False: 68.79% | True: 31.21% | 0% | 22.22% | |||
| Presence.Of.Children | False: 54.66% | True: 45.34% | 0% | 22.22% | |||
| Speciality.Store.Retail | False: 84.12% | True: 15.88% | 0% | 22.22% | |||
| Oil.Retail.Activity | False: 91.8% | True: 8.2% | 0% | 22.22% | |||
| Bank.Retail.Activity | False: 75.44% | True: 24.56% | 0% | 22.22% | |||
| Finance.Retail.Activity | False: 91.69% | True: 8.31% | 0% | 22.22% | |||
| Miscellaneous.Retail.Activity | False: 94.88% | True: 5.12% | 0% | 22.22% | |||
| Upscale.Retail | False: 94.25% | True: 5.75% | 0% | 22.22% | |||
| Upscale.Speciality.Retail | False: 96.44% | True: 3.56% | 0% | 22.22% | |||
| Retail.Activity | False: 60% | True: 40% | 0% | 22.22% |
The selected columns belonging to the product information section show only factorial values. The statistical overview for the product data reveals that most of the sold articles are replenishable. The strongest brand in the current orders is American Essential, which seems to manufacture its articles by itself. Important is to recognize that the presented data is biased in a way: Because the given dataset shows only ordered products, it can be assumed that the popularity of articles distorts all percentual information referring to the products. Thus, we can not make any assumptions referring to the original product palette the store offers.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| StockType | Replenishable: 69.58% | Seasonal 1: 23.08% | Replenishment: 5.14% | Seasonal 1*: 1.83% | Seasonal 2: 0.24% | 0% | 14.72% |
| Manufacturer | American Essentials: 20.91% | Ridgeview: 16.64% | HAN: 13.16% | Donna Karan Company: 10.85% | HOSO: 10.67% | Peneco: 8.98% | 0% |
| BrandName | AME: 22.07% | HOSO: 11.26% | ELT: 10.81% | Silk Reflections: 9.35% | DAN: 7.92% | 27.51% | 11.08% |
When it comes to the data concerning the used payment methods, there are only factorial columns as well. The most used credit card is by far the VISA card. Furthermore almost a fifth of the customers uses a premium card. From this information it could be deduced how wealthy the customer segment is by comparing the ratio of premium cards to the one in the whole population.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Order.Credit.Card.Brand | VISA: 59.94% | MC: 25.43% | AMEX: 14.31% | DISC: 0.31% | NA | 0% | 16.71% |
| Bank.Card.Holder | True: 86.57% | False: 13.43% | NA | 0% | 22.22% | ||
| Gas.Card.Holder | True: 75.81% | False: 24.19% | NA | 0% | 22.22% | ||
| Upscale.Card.Holder | True: 54.1% | False: 45.9% | NA | 0% | 22.22% | ||
| Unknown.Card.Type | False: 56.18% | True: 43.82% | NA | 0% | 22.22% | ||
| TE.Card.Holder | False: 89.42% | True: 10.58% | NA | 0% | 22.22% | ||
| Premium.Card.Holder | False: 75.88% | True: 24.12% | NA | 0% | 22.22% | ||
| New.Bank.Card | False: 99.55% | True: 0.45% | NA | 0% | 22.22% |
The numerical data for the order process shows that a customer usally buys one product per order. Also the order line amount implies that the store offers rather inexpensive articles. Furthermore, the minimum value for both the order line quantity and the order line amount is negative, which hints to the assumption of eather the order data containing returns as well or having errors in it. Through the discount amount it is possible to state that the store offers a maximum of a 50% price reduction for the given time period. Also the buyed articles got by average a discount of about 9%. Again it can be assumed that this data is not representative for the shop’s offer in general, because it is probable that articles with a higher discount are bought more often.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Order.Line.Quantity | 18 | 1.31 | 1.0 | -2 | 0.95 |
| Order.Line.Unit.List.Price | 72 | 9.26 | 7.5 | 0 | 6.46 |
| Order.Line.Amount | 234 | 11.62 | 10.0 | -40 | 11.51 |
| Order.Line.Hour.of.Day | 23 | 13.04 | 13.0 | 0 | 5.29 |
| Order.Discount.Amount | 50 | 8.82 | 10.0 | 0 | 9.98 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Spend.Over..12.Per.Order.On.Average | False: 64.04% | True: 35.96% | 0% | 0% | |||
| Order.Line.Day.of.Week | Wednesday: 26.96% | Tuesday: 17.72% | Thursday: 17.37% | Friday: 16.48% | Saturday: 8.43% | 13.04% | 0% |
| Order.Promotion.Code | FRIEND: 82.09% | SPRING: 2.14% | MARCH1: 1.92% | FREE: 1.39% | 4128003160593466: 1.13% | 0% | 23.15% |
The following tables resemble only the most important columns from the clickstream dataset. Interesting details are discussed in short texts. For the full range of columns see the Appendix.
The age is very similar to that of the Order Data, averaging at 37.58, with a median of 36. This shows that buyers do not differ to viewers in respect to their age.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Age | 86 | 37.58 | 36 | 18 | 10.71 |
The same can be said in regards to the location compared to the Order Data, in that there are only minor differences. Additionally, the gender distribution is very similar, heavily pointing out the shop’s main target group are women.
For all details see the full customer data table.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| City | San Francisco: 2.31% | New York: 2.18% | Chicago: 1.35% | Stamford: 1.09% | Dallas: 0.71% | 0% | 96.92% |
| US.State | CA: 13.28% | NY: 11.1% | TX: 5.97% | PA: 5.39% | IL: 4.68% | 0% | 96.92% |
| Marital.Status | Married: 61.8% | Single: 24.7% | Inferred Married: 7% | Inferred Single: 6.5% | 0% | 98.02% | |
| Gender | Female: 83.33% | Male: 16.67% | 0% | 98.43% | |||
| Audience | Women: 85.44% | Children: 10.27% | Men: 4.29% | 0% | 98.25% | ||
| Truck.Owner | False: 77.84% | True: 22.16% | 0% | 97.72% | |||
| RV.Owner | False: 91.17% | True: 8.83% | 0% | 97.72% | |||
| Motorcycle.Owner | False: 98.79% | True: 1.21% | 0% | 97.72% | |||
| Working.Woman | False: 65.54% | True: 34.46% | 0% | 97.72% | |||
| Presence.Of.Children | False: 50.39% | True: 49.61% | 0% | 97.72% | |||
| Speciality.Store.Retail | False: 84.59% | True: 15.41% | 0% | 97.72% | |||
| Oil.Retail.Activity | False: 90.65% | True: 9.35% | 0% | 97.72% | |||
| Bank.Retail.Activity | False: 77.58% | True: 22.42% | 0% | 97.72% | |||
| Finance.Retail.Activity | False: 90.3% | True: 9.7% | 0% | 97.72% | |||
| Miscellaneous.Retail.Activity | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale.Retail | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale.Speciality.Retail | False: 96.36% | True: 3.64% | 0% | 97.72% | |||
| Retail.Activity | False: 64.33% | True: 35.67% | 0% | 97.72% |
In the clickstream dataset most of the products are only viewed and not bought. That leads to some interesting distinctions to the Order Product Data. Seasonal 1* is 10% higher than in the Order Product Data and decreasing the replenishable part with the same amount. Donna Karan is leading the top manufacturer and the top brands, but is closely followed by its competitors. However, when compared to the Order Product Data this company is only in the 4th spot and its product not at all in the top 5, leading to the assumption that customers do not inherently buy that what is most viewed.
For all details see the full products data table.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| StockType | Replenishable: 60.93% | Seasonal 1: 23.75% | Seasonal 1*: 11.25% | Seasonal 2: 2.07% | Replenishment: 1.83% | 0% | 80.3% |
| Manufacturer | Donna Karan Company: 10.79% | Peneco: 9.08% | HAN: 8.66% | Kneipp: 6.61% | Paul Lavitt Mills Inc.: 6.54% | 0% | 80.12% |
| BrandName | DKNY: 9.99% | Silk Reflections: 9.21% | ORO: 8.85% | HPK: 7.22% | AME: 7.13% | 0% | 86.16% |
The payment data from the clickstream set is not as reliable, as this kind of information should only be available when purchases were made and so the data should be quite similar to the Order payment data. Interestingly the data sets differ more than expected, most importantly in the amount of premium cards which is lower by about 5% in the clickstream set, and new credit cards are reported at all.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| Bank.Card.Holder | True: 81.44% | False: 18.56% | NA | NA | NA | 0% | 89.44% |
| Gas.Card.Holder | True: 68.95% | False: 31.05% | NA | NA | NA | 0% | 89.44% |
| Upscale.Card.Holder | False: 50.88% | True: 49.12% | NA | NA | NA | 0% | 89.44% |
| Unknown.Card.Type | False: 59.71% | True: 40.29% | NA | NA | NA | 0% | 89.44% |
| TE.Card.Holder | False: 90.67% | True: 9.33% | NA | NA | NA | 0% | 89.44% |
| Premium.Card.Holder | False: 79.92% | True: 20.08% | NA | NA | NA | 0% | 89.44% |
| New.Bank.Card | False: 100% | NA | NA | NA | 0% | 89.44% |
Since not all interesting data can be recorded through simply summaring the existing columns, we decided to add calculated columns. The session duration gives the time in minutes a customer spend on the website, which gets calculated through the timestamps on the first and last click of a session. The click number represents how much clicks a customer made during one session. For both these values, the max values indicate that there could be bot activity on the website.
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Session.Duration | 1439.983 | 8.88 | 2.55 | 0 | 74.34 |
| Click.Number | 1842.000 | 46.93 | 4.00 | 1 | 189.84 |
Some information is too complex to be compressed into a single table without making it too confusing or it’s simply easier to understand if presented as a plot. The plot types used are time series plots, stacked bar plots, distribution curves, lorenz curves and maps.
The customer data of the order data can be viewed from two perspectives: One way is to use every single order row for the creation of the visualizations and therby create a weighted view on the data, in which customers that have bought more products are more respected. Another possibility is to display the customer information just for every unique customer in the order data and disregard the number of orders a customer made. The following plottings will show both of this perspectives.
Firstly, we genereted some density curves for the attribute age to visualize the distribution of the customership. Also, we differentiated between the genders for this. The curves all show a fast rise in customership for the ages 20 to 40, which decreases slowly after a peak at about 35 to 40. When comparing the weighted and the normal graph, a slightly shift of the curve can be observed. This indicates that in general older people tend to order more. This is especially relevant for an older male customership, which is shown by the peak at about 55 in the male curve of the weighted density plot. Additionally, a less wide peak can be observed for the male customers, which has its maximum shortly before the age of 40. Furthermore, it should be regarded, that the gender plot shows a prozentual curve for each gender, but the ratio of customers differs by gender.
In order to show the customer location, we generated a map of the United States, which shows the cities with the highest customer numbers. From this it can be observed that the west coast of the US orders in general the most. The difference between the weighted and the other customer graph shows that the customership from the large cities like New York or San Franscisco tends to have high order numbers since the large circles at this cities disappear, when we look at the non-weighted graph.
To visualize the importance of different areas, we created a heatmap for the different US States. Here we can see that California and New York are the most important customer states. This is probably highly influenced by the big cities in this states. Here, the distorting factor of population density must be taken into account.
For the product Data, it has to be regarded that we here visualize the orders and thereby the characteristics of more often bought products have a higher influence. Because of this, the following plots should be seen as popularity graphs of different product attributes.
The following stacked bar plot shows the stock types for the top brands and manufactures. Most of them have a replenishable assortment. The biggest brands and manufacturers seem to have mixed stock types, which contain partly seasonal products.
The lorenz curve of the products shows that only a quarter of the whole product quantity is responsible for about 75% of the ordered products. This indicates that some products have a very high popularity.
The manufacturer lorenz curve shows a real high ratio on orders for the biggest manufacturer, which leads to the assumptions that the popularity of some manufacturers is even higher than the product popularity.
The brand lorenz curve looks a little more flat, indicating a higher relevance for the manufacturer than for the brand.
The stacked bar plot for credit card brands shows the ratio of each brand on premium and on upscale cards. The card brand AMEX is here noticable with a relative high ratio of premium and upscale cards.
For the order process data some graphs referring to the order amount and price were created. It has to be respected again that the order data can deform the graphs, by example it is imaginable that the customers tend to buy the cheaper products and therefore the average product price seems lower.
The density curve for the discount amount shows 3 peaks: The first one has a medium height and is around a discount of 0%, the second one is around the 10% mark and is pretty large, whereas the last peak is at a discount of 50% and is rather low. This might indicate that customers buy rather targeted than randomly.
A density curve reflecting the order time is shown by the next plot. As to be expected the order amount goes down through the night. During the day the ordering is relative stable with some small peaks at 10 a.m. and in the afternoon. The activity in the afternoon can be explained by the average working hours, which mostly allow people only to spend time on online shopping at the afternoon and evening.
Next, there is displayed a graph, which summarizes some information on order behaviour. From this visulalization we can learn that the online shop sells products of a low price segments, but usually receives orders contaning a rather high amount (peak at 3-15) of articles. The different history plots on the right indicate a high activity in the first half of February, which drops in the beginning of the second month half and than slowly rises again. The first half could possibly be explained through customers buying presents for the valentines day (14th February).
Especially interesting for analysing purposes is the time component that comes with the clickstream dataset. Each clickstream consists of timestamps and a number in a sequence.
Most of the licks are accumulated in the morning and then descending over day. The extreme spike at around 2 in the morning can be neglected as it seems to be bot activity with extremly long session times and sequences.
The above mentioned behavior is also visible in this graph. Overall there is more activity during the middle of months and less towards the end. Also, the seasonal component of the clickstream data set is visible very well.
While the session length on average is about 8 min, the median only is at about 2 min (see table). This is visible in the graph where the high point is also at about 2 min indicating the skewness of the time data. Most sessions last only around 10 min, which leads to the assumption that products should be displayed well as the customers do not waste much time on searching for them.
In addition to analyzing the order and clickstream data, we have to analyze the performance of different recommendation models. The performance of three different recommendation systems was measured:
The evaluation of the profit and ranking based recommendation systems is done using inference analysis, specifically using the computational paradigm instead of the mathematical one. One test for each of the two recommendation systems is carried out with the null hypothesis always being that the system does not cause different sales than a purely random recommendation system. During each test we randomize our sample data 1000 times, using either permutation or bootstrapping, and measure the p-value and confidence interval. The test statistic, we always use, is the difference in mean between the group using the profit or ranking based Recomendation system and the group using the random recommendation system. If the null hypothesis was true, then the test statistic value for our sample would not significantly differ from the distribution of the test statistic for our randomized data. Our default alpha for the confidence interval is 5%, but since we conduct a total of two tests, we have to apply the Bonferroni correction and adjust the alpha we specify for our confidence interval to 2.5%.
Before diving into the inference analysis itself, we have to reformat our data for the recommendation systems into a form that is suitable for inference analysis. We want to have a data frame in which one row equals one customer, who was exposed to a recommendation system. We use three columns:
If the value in column Used_Top_recommendations and Used_Profit_Oriented_recommendations is 0, it means that the random recommendation system was used.
In the following code block we reformat the data and print the first 10 rows of the reformatted table.
| Sales_in_EUR | Used_Profit_Oriented_recommendations | Used_Top_recommendations |
|---|---|---|
| 8.50 | 0 | 0 |
| 20.00 | 1 | 0 |
| 16.00 | 0 | 1 |
| 8.50 | 0 | 0 |
| 17.75 | 1 | 0 |
| 19.75 | 0 | 1 |
| 18.75 | 0 | 0 |
| 21.75 | 1 | 0 |
| 19.75 | 0 | 1 |
| 17.75 | 0 | 0 |
Now, we do an inference analysis for the profit oriented recommendation system. Firstly, we look at the p-value and the corresponding plot:
## [1] "p-value = 0"
The plot shows us the distribution of the test statistic for the 1000 randomized samples. The test statistic value for our sample is represented by a black line. The two-sided p-value regions are marked by a grey background. If our null hypothesis were true, then the test statistic value of our sample would be somewhere in the distribution of the test statistic for the randomized samples. Every test statistic value for a randomized sample, which lies in the p-value region, increases the p-value.
As we can see the test statistic value of our sample is pretty far away from the test statistic values of the randomized samples. This already shows, without looking at the p-value itself, that the profit oriented recommendation system causes a significant difference in the sales in Euro. The value of the p-value is 0, which reaffirms our interpretation of the plot.
Now, we look at the confidence interval:
| 2.5% | 97.5% |
|---|---|
| 3.275654 | 4.399628 |
As we can see, there is a 95% chance that for the global population customers, who get profit oriented recommendations, spend on average 3.28€-4.4€ more than people who get random recommendations.
Now we do an inference analysis for the ranking based recommendation system. Firstly, we look at the p-value and the corresponding plot:
## [1] "p-value = 0.08"
In this plot quite some portions of the distribution of the test statistic for our random samples lie in the p-value zone. This is also shown by the p-value 0.08, which greater than 0.025. This means that for our alpha = 0.05 the effect of the ranking based recommendation system is statistically insignificant.
Now we look at the confidence interval:
| 2.5% | 97.5% |
|---|---|
| -0.0922301 | 1.263623 |
Since the confidence interval includes the value 0, it shows us that the effect is statistically insignificant.
To sum it up, the company should use the profit oriented recommendation system, since out of the two tested systems it causes the largest increase in sales. The ranking based recommendation system does not cause any statistically relevant difference in sales. However if the effect was something else than sales in Euro per person, then the results could be different. The company should ask itself if the goal of their recommendation system should really be to increase sales. Maybe at some point in time it could introduce a subscription business model, similar to that of Amazon. In that case it might also want to increase the percentage of customers that have a subscription.
Following the inference analysis our next task is to predict whether or not a person browsing the online-shop will end up purchasing something, based on the browsing behaviour on the site.
First of all the relevant data had to be extracted, or engineered, from the clickstream data and put into a seperate dataset, to be used as the input data for a prediction model. The resulting data set contains one row per session. For each session the following attributes are used:
Since the last two attributes have to be one-hot-encoded, it is not possible to show the head of the dataset in this report. However an important thing to note is that about 99% of sessions do not contain an order. In addition to that, sessions consisting of only one click are ignored, since they offer no valuable information.
After preparing the data the next step is training and testing a model. The model of choice is a decision tree, since it allows to easily understand the decision making process of the model, and since it removes the need for dimensionality reduction and feature scaling, thus reducing the required work. The evaluation metric used during the Grid Search is roc-score. However, depending on the evolving needs of the company, one might want to chose a different evaluation metric.
Since the training set is created by using random samples, model performance can vary during each run. However, on average the accuracy should be around 90%, the F1 score at around 93% and the ROC score at around 95%. Also, on average the most important features are the number of clicks and the session duration.
The confusion matrix can be seen below. As you can see, the model classifies around 95% of sessions, which contain an order, correctly. However, only around 89% of instances, which do not contain an order, are classified correctly. This brings up the following question: How should each classification type be weighted? To answer this the company should first ask itself “How, if at all, do we want to target session users differently?”. For example, one idea could be to offer discounts to visitors, which have a high click rate during short sessions, as this could imply unsatisfaction with the price or quality of the visited products.
Decision Tree - Confusion Matrix
As mentioned above the main reason for choosing a decision tree is the ease of understanding its logic and the automatic feature filtering. Below you can see a visualization of the decision tree. It has three layers. The only variables of interest are the session duration and the number of clicks during the session. As you can see, whenever a session has over 10.5 clicks and lasts for longer than 141 seconds the tree predicts that it contains an order.
Decision Tree Vizualized
It should be noted that if the company prefers a model with a higher performance and less traceability, other models should be considered as well. Among them are Logistic Regression, Random Forest and Neural Networks. Methods for automatic dimensionality reduction, such as PCA, or for iterative model improvements, such as Bagging or Boosting, should also be considered in that case.
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| WhichDoYouWearMostFrequent | casual socks: 34.62% | hosiery: 31.14% | athletic socks: 19.15% | trouser socks: 15.09% | 0% | 98.98% | |
| YourFavoriteLegcareBrand | Conair: 15.4% | Nature Made: 12.79% | Epilady: 10.7% | eShave: 7.31% | Lucky Chick: 6.53% | 0% | 99.24% |
| Registration.Gender | Female: 52.17% | Male: 47.83% | 0% | 99.95% | |||
| NumberOfChildren | 0: 60.87% | 1: 13.04% | 2: 13.04% | 4 or more: 13.04% | 0% | 99.95% | |
| DoYouPurchaseForOthers | False: 100% | 0% | 96.96% | ||||
| HowDoYouDressForWork | business casual: 41.57% | very casual: 29.22% | business dress: 15.29% | comfortable / athletic: 13.92% | 0% | 98.99% | |
| HowManyPairsDoYouPurchase | 15 or more: 48.9% | 11 to 15: 30.94% | 1 to 5: 14.37% | 6 to 10: 5.79% | 0% | 99.01% | |
| YourFavoriteLegwearBrand | Hanes: 39.14% | DKNY: 17.22% | Donna Karan: 8.61% | Danskin: 8.22% | Berkshire: 5.09% | 0% | 98.99% |
| WhoMakesPurchasesForYou | spouse: 50.51% | friend: 24.24% | parent: 23.23% | siblings: 2.02% | 0% | 99.8% | |
| NumberOfAdults | 2: 43.48% | 3 or more: 34.78% | 1: 21.74% | 0% | 99.95% | ||
| HowDidYouHearAboutUs | other: 38.32% | friend / family: 31.99% | e-mail: 18.15% | print ad: 9.23% | direct mail: 1.19% | 0% | 97.34% |
| SendEmail | True: 65.92% | False: 34.08% | 0% | 96.92% | |||
| HowOftenDoYouPurchase | every 6 months: 75.84% | once a year: 16.62% | each week: 7.53% | 0% | 99.24% | ||
| HowDidYouFindUs | Friend/Co-worker: 69.57% | Other: 26.09% | News Story: 4.35% | 0% | 99.95% | ||
| City | San Francisco: 2.31% | New York: 2.18% | Chicago: 1.35% | Stamford: 1.09% | Dallas: 0.71% | 0% | 96.92% |
| US.State | CA: 13.28% | NY: 11.1% | TX: 5.97% | PA: 5.39% | IL: 4.68% | 0% | 96.92% |
| COM: 72.1% | NET: 19.69% | Gazelle: 2.95% | EDU: 2.57% | Other: 2.12% | 0% | 96.92% | |
| Truck.Owner | False: 77.84% | True: 22.16% | 0% | 97.72% | |||
| RV.Owner | False: 91.17% | True: 8.83% | 0% | 97.72% | |||
| Motorcycle.Owner | False: 98.79% | True: 1.21% | 0% | 97.72% | |||
| Marital.Status | Married: 61.8% | Single: 24.7% | Inferred Married: 7% | Inferred Single: 6.5% | 0% | 98.02% | |
| Working.Woman | False: 65.54% | True: 34.46% | 0% | 97.72% | |||
| Mail.Responder | True: 76.54% | False: 23.46% | 0% | 97.72% | |||
| Bank.Card.Holder | True: 83.55% | False: 16.45% | 0% | 97.72% | |||
| Gas.Card.Holder | True: 72.73% | False: 27.27% | 0% | 97.72% | |||
| Upscale.Card.Holder | True: 51.08% | False: 48.92% | 0% | 97.72% | |||
| Unknown.Card.Type | False: 60.17% | True: 39.83% | 0% | 97.72% | |||
| TE.Card.Holder | False: 91% | True: 9% | 0% | 97.72% | |||
| Premium.Card.Holder | False: 78.96% | True: 21.04% | 0% | 97.72% | |||
| Presence.Of.Children | False: 50.39% | True: 49.61% | 0% | 97.72% | |||
| Estimated.Income.Code | $50;000-$74;999: 23.17% | $75;000-$99;999: 17.74% | $40;000-$49;999: 11.41% | $30;000-$39;999: 11.23% | $125;000 OR MORE: 10.34% | 0% | 97.78% |
| Home.Market.Value | $75;000-$99;999: 16.61% | $50;000-$74;999: 15.09% | $100;000-$124;999: 13.68% | $125;000-$149;999: 9.12% | $150;000-$174;999: 7.49% | 0% | 98.31% |
| New.Car.Buyer | True: 100% | 0% | 98.97% | ||||
| Vehicle.Lifestyle | IMPORT (STANDARD/ECONOMY): 27.76% | FULL SIZE (STANDARD/LUXURY): 22.64% | TRUCK OR UTILITY VEHICLE: 12.6% | SPECIALTY (MIDSIZE/SMALL): 11.42% | STATION WAGON: 10.83% | 0% | 99% |
| Property.Type | single family dwelling: 86.59% | condo: 7.53% | 2-4 unit(duplex;triplex;quad): 2.35% | misc. residential (condo store/flat): 1.88% | apartment(5+ units): 0.94% | 0% | 99.16% |
| Loan.To.Value.Percent | 0% (NO LOANS): 30.26% | 100-99%: 10.53% | 70-74%: 8.88% | 75-79%: 8.88% | 80-84%: 8.88% | 0% | 99.4% |
| Presence.Of.Pool | False: 98.87% | True: 1.13% | 0% | 97.72% | |||
| Own.Or.Rent.Home | Owner: 93.56% | Renter: 6.44% | 0% | 97.97% | |||
| Mail.Order.Buyer | True: 64.68% | False: 35.32% | 0% | 97.72% | |||
| DMA.No.Mail.Solicitation.Flag | True: 100% | 0% | 97.72% | ||||
| DMA.No.Phone.Solicitation.Flag | True: 100% | 0% | 97.72% | ||||
| New.Bank.Card | False: 100% | 0% | 97.72% | ||||
| Speciality.Store.Retail | False: 84.59% | True: 15.41% | 0% | 97.72% | |||
| Oil.Retail.Activity | False: 90.65% | True: 9.35% | 0% | 97.72% | |||
| Bank.Retail.Activity | False: 77.58% | True: 22.42% | 0% | 97.72% | |||
| Finance.Retail.Activity | False: 90.3% | True: 9.7% | 0% | 97.72% | |||
| Miscellaneous.Retail.Activity | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale.Retail | False: 94.37% | True: 5.63% | 0% | 97.72% | |||
| Upscale.Speciality.Retail | False: 96.36% | True: 3.64% | 0% | 97.72% | |||
| Retail.Activity | False: 64.33% | True: 35.67% | 0% | 97.72% | |||
| Dwelling.Size | SINGLE HOUSEHOLD: 75.58% | 2 HOUSEHOLDS: 6.78% | 100+ HOUSEHOLDS: 3.53% | 3 HOUSEHOLDS: 2.41% | 10-19 HOUSEHOLDS: 1.95% | 0% | 97.87% |
| Lendable.Home.Equity | EQUITY LESS THAN OR EQUAL $0: 33.22% | EQUITY $10;000-$19;9999: 9.54% | EQUITY $1-$4;999: 8.55% | EQUITY $75;000-$99;999: 8.55% | EQUITY $100;000-$149;999: 8.22% | 0% | 99.4% |
| Home.Size.Range | 1;250-1;499 FT: 16.04% | 2;000-2;499 FT: 15.41% | 1;000-1;249 FT: 14.47% | 1;500-1;749 FT: 13.84% | 1;750-1;999 FT: 10.69% | 0% | 99.37% |
| Lot.Size.Range | 1 ACRE OR LESS: 89.58% | GREATER THAN 1 ACRE: 10.42% | 0% | 99.43% | |||
| Dwelling.Unit.Size | SINGLE FAMILY DWELLING UNIT: 74.03% | MULTI FAMILY DWELLING UNIT: 25.97% | 0% | 97.81% | |||
| Available.Home.Equity | EQUITY $30;000-$49;000: 19.67% | EQUITY $50;000-$74;000: 18.56% | EQUITY $75;000-$99;999: 11.6% | EQUITY $20;000-$29;000: 11.27% | EQUITY $100;000-$149;999: 11.05% | 0% | 98.21% |
| Minority.Census.Tract | False: 98.61% | True: 1.39% | 0% | 97.72% | |||
| Gender | Female: 83.33% | Male: 16.67% | 0% | 98.43% | |||
| Occupation | PROFESSIONAL/TECHNICAL: 32.33% | HOUSEWIFE: 15.54% | ADMINISTRATIVE/MANAGERIAL: 14.29% | CLERICAL/WHITE COLLAR: 11.28% | STUDENT: 6.02% | 0% | 99.21% |
| Other.Indiv…Gender | Male: 81.91% | Female: 18.09% | 0% | 98.84% | |||
| Other.Indiv…Occupation | PROFESSIONAL/TECHNICAL: 45.9% | ADMINISTRATIVE/MANAGERIAL: 17.93% | CRAFTSMAN/BLUE COLLAR: 13.68% | SALES/SERVICE: 6.38% | CLERICAL/WHITE COLLAR: 4.86% | 0% | 99.35% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| Year.of.Birth | 1979 | 1966.58 | 1966 | 1948 | 7.95 |
| Value.Of.All.Vehicles | 99000 | 19111.56 | 16000 | 1000 | 14190.98 |
| Age | 86 | 37.58 | 36 | 18 | 10.71 |
| Other.Indiv…Age | 86 | 40.90 | 38 | 18 | 11.51 |
| Number.Of.Adults | 6 | 2.54 | 2 | 1 | 1.35 |
| Year.House.Was.Built | 1997 | 1963.50 | 1968 | 1850 | 27.02 |
| Length.Of.Residence | 15 | 6.77 | 6 | 0 | 4.39 |
| Year.Home.Was.Bought | 1999 | 1991.18 | 1993 | 1954 | 6.34 |
| Home.Purchase.Date | 199906 | 199121.06 | 199300 | 195400 | 633.82 |
| Number.Of.Vehicles | 3 | 1.47 | 1 | 1 | 0.61 |
| CRA.Income.Classification | 4 | 3.31 | 3 | 1 | 0.63 |
| Number.Of.Credit.Lines | 9 | 2.72 | 3 | 1 | 1.61 |
| Dataquick.Market.Code | 10 | 4.61 | 4 | 1 | 2.53 |
| Insurance.Expiry.Month | 12 | 6.54 | 6 | 1 | 3.45 |
| Month.Home.Was.Bought | 12 | 7.03 | 7 | 1 | 3.42 |
| Year.Of.Structure | 1999 | 1973.29 | 1980 | 1900 | 26.47 |
| Variable | Top.1st | Top.2nd | Top.3rd | Top.4th | Top.5th | Others | Not.Available |
|---|---|---|---|---|---|---|---|
| BrandName | DKNY: 9.99% | Silk Reflections: 9.21% | ORO: 8.85% | HPK: 7.22% | AME: 7.13% | 0% | 86.16% |
| PrimaryPackage | Bottle: 35.18% | Tube: 30.46% | Jar: 25.55% | Box: 5.4% | Spray: 3.41% | 0% | 96.07% |
| StockType | Replenishable: 60.93% | Seasonal 1: 23.75% | Seasonal 1*: 11.25% | Seasonal 2: 2.07% | Replenishment: 1.83% | 0% | 80.3% |
| ProductForm | Cream: 54.01% | Liquid: 24.19% | gel: 7.58% | Lotion: 7.36% | Capsule: 5.97% | 0% | 96.64% |
| Look | Sheer: 83.41% | Ultra Sheer: 13.1% | Opaque: 3.48% | 0% | 94.22% | ||
| BasicOrFashion | Basic: 92.33% | Fashion: 7.67% | 0% | 86.11% | |||
| MfgStyleCode | Tricot: 2.34% | BC27340: 1.45% | 00N02: 1.35% | 00Q63: 1.34% | 5751: 1.2% | 10.26% | 82.06% |
| SaleOrNonSale | NSALE: 100% | 0% | 94.46% | ||||
| HasDressingRoom | False: 73.56% | True: 26.44% | 0% | 86.09% | |||
| ColorOrScent | Scent: 85.69% | Color: 14.31% | 0% | 99.7% | |||
| Texture | Flat: 66.48% | Textured: 33.52% | 0% | 96.65% | |||
| Manufacturer | Donna Karan Company: 10.79% | Peneco: 9.08% | HAN: 8.66% | Kneipp: 6.61% | Paul Lavitt Mills Inc.: 6.54% | 0% | 80.12% |
| ToeFeature | SF: 86.04% | RT: 13.96% | 0% | 93.49% | |||
| Category2 | Gift Sets & Special Items: 32.01% | Skincare: 23.28% | Cellulite & Other Treatments: 22.82% | Footcare: 14.47% | Health Supplements: 6.2% | 0% | 99.21% |
| Material | Cotton: 66.35% | Nylon: 23.62% | Coolmax: 3.71% | Rayon: 1.49% | Lycra: 1.04% | 0% | 93.18% |
| CategoryCode | PH: 33.89% | WDCS: 12.66% | TH: 7.83% | FO: 6.76% | TT: 6.2% | 0% | 86.09% |
| WaistControl | CT: 76.17% | STW: 23.83% | 0% | 94.34% | |||
| Collection | Oroblu Italian Hosiery: 6.31% | Conversationals: 5.17% | DKNY Skin: 4.41% | Action Pack 3-Pair: 3.9% | Womens Dance: 3.84% | 0% | 86.96% |
| BodyFeature | MBC: 64.5% | UBC: 17.51% | LBC: 11.17% | BS: 6.82% | 0% | 98.49% | |
| Audience | Women: 80.86% | Men: 10.21% | Children: 8.93% | 0% | 86.09% | ||
| Category1 | Skincare: 60.62% | Footcare: 18.05% | Cellulite & Other Treatments: 15% | Hair Removal: 3.89% | Health Supplements: 1.83% | 0% | 94% |
| Product | Cellulite Trimming Gel: 3.25% | Body Lotion - Oceanic Minerals: 2.74% | Kit-Firming Cream/Slimming Cream/Shorts: 2.57% | Body Silk: 2.41% | Herbal Foot Balm: 2.31% | 0% | 94% |
| Pattern | Solid: 58% | Conversational: 39.12% | Floral: 2.03% | Stripe: 0.55% | Herringbone: 0.18% | 0% | 93.72% |
| Variable | Max | Mean | Median | Min | SD |
|---|---|---|---|---|---|
| UnitsPerInnerBox | 12.0 | 4.44 | 3.00 | 1.00 | 2.91 |
| Depth | 16.0 | 2.73 | 2.50 | 0.50 | 2.15 |
| VendorMinREOrderDollars | 500.0 | 161.15 | 150.00 | 100.00 | 82.29 |
| Height | 8.5 | 1.18 | 0.75 | 0.25 | 1.20 |
| UnitsPerOuterBox | 144.0 | 18.57 | 12.00 | 4.00 | 16.83 |
| Pack | 3.0 | 1.14 | 1.00 | 1.00 | 0.50 |
| Length | 16.5 | 9.12 | 9.25 | 3.50 | 1.65 |
| MinQty | 144.0 | 16.48 | 6.00 | 0.00 | 30.45 |
| LeadTime | 28.0 | 10.95 | 10.00 | 1.00 | 6.41 |
| Weight | 40.0 | 4.98 | 2.60 | 0.40 | 5.59 |
| Width | 18.0 | 5.68 | 6.25 | 0.50 | 1.93 |
| UnitIncrement | 36.0 | 4.81 | 3.00 | 1.00 | 4.38 |